Ruimin Wang (wrmkeer@ucla.edu) Feature Extraction, Text Mining, Feature Visulization
Sha Li (shali@ucla.edu) Feature Test,Gradient Boosting & Analysis, Documentation
Jun Ma (mj.saber1990@gmail.com) Data Processing, RandomForest & Analysis, Documentation
from IPython.display import Image
Image(filename='/Users/shali/Documents/CS249/Midterm/CS249_Project/wordCloud.JPG',
embed=True)
3.1 Data Source
3.2 Data Preprocessing
4.1 Feature Exploration Before Modeling
5.1 Gradient Boost(Multiclass vs. Two-layer)
5.1.1 Without Text Features
5.1.2 With Text Features
5.2 Random Forest(Multiclass vs. Two-layer)
5.2.1 Without Text Features
5.2.2 With Text Features
%load_ext rmagic
import rpy2 as Rpy
%%R
not.installed = function(package_name) !is.element(package_name, installed.packages()[,1])
if (not.installed("tm")) install.packages("tm")
if (not.installed("topicmodels")) install.packages("topicmodels")
if (not.installed("slam")) install.packages("slam")
library(tm)
library(topicmodels)
library(slam)
library(gbm)
library(randomForest)
library(plyr)
library(MASS)
Reviews in Yelp are helpful for customers when choosing stores or restaurants. Customers can get information from reviews and choose the store that satisfy their requirements best. In general, newly posted reviews may be more valuable compared to those old reviews, since they contains up-to-date information of the stores. However, they have not received many useful votes that reflects their usfulness, since they are newly posted. Our goal is to identify the reviews that might be informative to users.
We decided to label the reviews with 'High', 'Medium' and 'Low' to distinguish their value to the users. Considering the fact that only a few reviews for a certain business are posted everyweek, picking out the fresh reviews that is potentially highly useful and put it in front of the review list is good in practice.
Our project is to analyze the usefulness of reviews from the Yelp Academic dataset, and predict whether a newly posted review is useful or not.
We setup our analysis as a classification problem. As was mentioned in the 'Motivation' section, only a few reviews are posted for a certain business every week, so it is more practical to determine whether it is useful rather than predicting how many useful votes it will receive. To properly label reviews according to its usefulness, we first looked into the 'votes_useful' attribute in the review data. By plotting its percentile vs. number of votes curve, we found that 46% of the reviews received no useful votes, about 50% of the reviews received one to four useful votes, and less than 5% of the reviews are voted useful by more than five times. Considering the fact that useful reviews are not common, we decided to label reviews with five or more useful votes as 'High', meaning highly useful. Moreover, we should not ignore the reviews which received some useful votes but is not highly useful. They may be useful to some people, but they are not useful to most of the customers. We thus labeled them as 'Medium'. For reviews whose useful votes are zero, we labeled them as 'Low'.
We further explored the data grouped by categories. Considering that the reviews of different categories are different, e.g., use of words, we decided to analysis the reviews by categories. We chose four categories, i.e., 'Restaurants', 'Nightlife', 'Shopping' and 'Food', where each category contains at least 20 thousand reviews.
The raw data does not present useful information directly. We dug some features from the raw data. In the user dataset, we found that for each user, there are some vital attributes: 'votes_useful', 'votes_cool', and 'votes_funny'. These are overall votes the user received from his previous reviews. We then use the averages of the votes as three features. In the review dataset, we extracted text features and non-text features.
For non-text features, we used ddply to group and aggregate data to extract features. For example, for each user, we extracted the statistics of useful, funny, cool votes and stars, e.g., median value of the number of cool votes that the user received for all his reviews from the review dataset. Although these features are user-wise, they are not presented in the user dataset. We also extracted the statistics of useful, funny, cool votes and stars corresponding to individual business from the review dataset, which are not given in business dataset.
For text features, we first observed some apparent characteristics present in the text and used regular expression to acquire these features, e.g., number of smileys in the text. Then we decided to delve into the text. We first inspected word usage and constructed the TF-IDF matrix, based on which we computed pair-wise document similarities. With the similarity matrix, we clustered the data and used the cluster index as a feature in our dataset. We then studied the grammar in the text. We initially tried to investigate the sentence structure, but since it varies a lot across reviews, we cannot find a good structure to fit all reviews. Thus we counted the number of adjectives in each review, and put it into our feature set.
In the business dataset, we calculated the difference between each review rating and the business average star rating. Furthermore, we included some information from check-in dataset, i.e. num_reviews/num_checkin for each business.
We primarily depended on two classification methods to classify our data: Gradient Boost and Random Forest. By investigating the previous work, gradient boosting and random forest reveals best results among other methods. In addition, ensemble-learning methods gives better result when the weak classifiers are not favorable in classification.
Since our problem is a multi-class classification problem, we used One vs. All approach to do multi-class classification. Since we found our result is not as good as we expected, we used a two-layer approach to improve the accuracy. In each layer, we first divided data into two groups and did binary classification. In the first layer, we trained a binary classification model to separate the 'Low' from the 'Median' and 'High'. In the second layer, we selected data that contains only 'Median' and 'High' classes and trained another model to classify these two classes. This method improved our test accuracy.
Yelp Academic Dataset from the greater Phoenix, AZ metropolitan area including:
15,585 businesses
111,561 business attributes
11,434 check-in sets
70,817 users
151,516 edge social graph
113,993 tips
335,022 reviews
business { 'type': 'business', 'business_id': (encrypted business id), 'name': (business name), 'neighborhoods': [(hood names)], 'full_address': (localized address), 'city': (city), 'state': (state), 'latitude': latitude, 'longitude': longitude, 'stars': (star rating, rounded to half-stars), 'review_count': review count, 'categories': [(localized category names)] 'open': True / False (corresponds to closed, not business hours), 'hours': { (day_of_week): { 'open': (HH:MM), 'close': (HH:MM) }, ... }, 'attributes': { (attribute_name): (attribute_value), ... }, } review { 'type': 'review', 'business_id': (encrypted business id), 'user_id': (encrypted user id), 'stars': (star rating, rounded to half-stars), 'text': (review text), 'date': (date, formatted like '2012-03-14'), 'votes': {(vote type): (count)}, } user { 'type': 'user', 'user_id': (encrypted user id), 'name': (first name), 'review_count': (review count), 'average_stars': (floating point average, like 4.31), 'votes': {(vote type): (count)}, 'friends': [(friend user_ids)], 'elite': [(years_elite)], 'yelping_since': (date, formatted like '2012-03'), 'compliments': { (compliment_type): (num_compliments_of_this_type), ... }, 'fans': (num_fans), } check-in { 'type': 'checkin', 'business_id': (encrypted business id), 'checkin_info': { '0-0': (number of checkins from 00:00 to 01:00 on all Sundays), '1-0': (number of checkins from 01:00 to 02:00 on all Sundays), ... '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays), ... '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays) }, # if there was no checkin for a hour-day block it will not be in the dict } tip { 'type': 'tip', 'text': (tip text), 'business_id': (encrypted business id), 'user_id': (encrypted user id), 'date': (date, formatted like '2012-03-14'), 'likes': (count), }
%%R
reviews <- read.csv(file = "~/Desktop/Yelp/yelp_academic_dataset_review.csv")
%%R
usefulVotesMoreThanOne <- reviews$votes_useful#[reviews$votes_useful>1]
distr.fit = fitdistr(usefulVotesMoreThanOne,'exponential')
hist((usefulVotesMoreThanOne), freq=FALSE, breaks=50,xlim = c(0,50), col = "blue",border = NA)
curve( dexp(x,rate=distr.fit$estimate[1]), col="red", add=TRUE )
%%R
#print(usefulCount)
usefulPercentile <- ddply(reviews, .(votes_useful), summarize,
percentile = sum(reviews$votes_useful<votes_useful)/length(reviews$votes_useful))
plot(usefulPercentile$votes_useful, usefulPercentile$percentile, type="l")
points(usefulPercentile$votes_useful[c(2,6,11)], usefulPercentile$percentile[c(2,6,11)], pch=20,col = "blue")
text( usefulPercentile$votes_useful[c(2,6,11)], usefulPercentile$percentile[c(2,6,11)],
c(sprintf("(0,%.3f)",usefulPercentile$percentile[2]),
sprintf("(5,%.3f)",usefulPercentile$percentile[6]),
sprintf("(10,%.3f)",usefulPercentile$percentile[11])
), data = usefulPercentile[c(2,6,11),],pos=4, cex=1, col="blue" )
%%R
business <- read.csv(file = "~/Desktop/Yelp/yelp_academic_dataset_business.csv")
%%R
#print(head(business))
if (exists("categoryIndex")){
rm(categoryIndex)
}
BigCategories<- c("Restaurants","Nightlife","Shopping","Food","Health & Medical","Beauty & Spas",
"Home Services","Local Services","Event Planning & Services","Arts & Entertainment",
"Active Life","Professional Services","Automotive","Hotels & Travel","Education","Real Estate",
"Pets","Financial Services","Local Flavor","Public Services & Government","Mass Media","Religious Organization")
#categoryIndex<-data.frame()
for(category in BigCategories){
categoryname<-gsub(" ","",category)
categoryname<-gsub("&","",categoryname)
if (!exists("categoryIndex")){
Express<- sprintf("%s<-grepl(category,business$categories,ignore.case = TRUE)
categoryIndex<-data.frame(%s)",categoryname,categoryname,categoryname)
}
else{
Express<- sprintf("categoryIndex$%s <- grepl(category,business$categories,ignore.case = TRUE)",categoryname)
}
eval(parse(text = Express))
}
print(colSums(categoryIndex))
%%R
reviewsForCategories <- list()
for(category in colnames(categoryIndex)){
Express<- sprintf("reviewsForCategories$%s <- subset(reviews,business_id %%in%%
business$business_id[categoryIndex$%s])",
category,category)
eval(parse(text = Express))
}
%%R
## number of reviews for each category
categoryNames<-names(reviewsForCategories)
numberOfReviews<-c()
for(i in 1:length(reviewsForCategories)){
cat(categoryNames[i],": ",nrow(reviewsForCategories[[categoryNames[i]]]),"\n")
numberOfReviews <- c(numberOfReviews,nrow(reviewsForCategories[[categoryNames[i]]]))
}
%%R
numberOfBusinessInCategory<-colSums(categoryIndex)
plot(log(numberOfBusinessInCategory),log(numberOfReviews),col = 2+2*(numberOfReviews>20000))
fit.lm <- lm(log(numberOfReviews)~log(numberOfBusinessInCategory))
abline(fit.lm, untf = FALSE)
mtext("approximately satisfies power law")
legend(3.5,12, c("Categories of interests"), col = c(4),
text.col = 4, pch = c(1),
merge = TRUE,)
%%R
## reviews for Restaurants
reviewsRestaurants <- reviewsForCategories$Restaurants
print(nrow(reviewsRestaurants))
## reivews for Food
reviewsFood <- reviewsForCategories$Food
print(nrow(reviewsFood))
## reviews for Nightlife
reviewsNightlife <- reviewsForCategories$Nightlife
print(nrow(reviewsNightlife))
## reviews for Shopping
reviewsShopping <- reviewsForCategories$Shopping
print(nrow(reviewsShopping))
%%R -w 800 -h 800
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
hist(reviewsRestaurants$votes_useful, breaks=50,xlim = c(0,50), col = "blue",border = NA)
hist(reviewsFood$votes_useful, breaks=50,xlim = c(0,50), col = "blue",border = NA)
hist(reviewsNightlife$votes_useful, breaks=50,xlim = c(0,50), col = "blue",border = NA)
hist(reviewsShopping$votes_useful, breaks=50,xlim = c(0,50), col = "blue",border = NA)
par(opar)
%%R
print(colnames(reviewsRestaurants))
%%R
#filesPath <- "/Users/shali/Downloads/CS249/Midterm/CS249_Project/YelpCsv"
#corpus <- Corpus(DirSource(filesPath,encoding = "UTF-8") , readerControl = list(lange="english"))
#VectorSource
corpus_Shopping <- Corpus(VectorSource(reviewsShopping$text) , readerControl = list(lange="english"))
print(corpus_Shopping)
%%R
stopw <- c("hes","and", "one", "just", "will", "the", "can")
corpus_Shopping <- tm_map(corpus_Shopping, removePunctuation)
corpus_Shopping <- tm_map(corpus_Shopping, tolower)
corpus_Shopping <- tm_map(corpus_Shopping, removeWords,c(stopw,stopwords("english")))
%%R
dtm_Shopping <- DocumentTermMatrix(corpus_Shopping)
print(dim(dtm_Shopping))
print(dtm_Shopping)
%%R
num_elements_array <-rollup(dtm_Shopping,2,na.rm=TRUE,FUN=sum)#apply(dtm,1,sum)
num_elements <-as.array(num_elements_array)
delete_index <- c()
j<-1
for(i in 1:length(num_elements))
{
if(num_elements[i]==0)
{
delete_index[j]<-i
j<-j+1
}
}
if(length(delete_index)>0){
dtm_Shopping<-dtm_Shopping[-delete_index,]
}
%%R
k <- 5
lda_result <- LDA(dtm_Shopping, k, method="VEM")
terms_lda<-terms(lda_result, 20)
filew <- file("lda_5_Shopping_topic.txt",open="wt")
for(i in 1:length(terms_lda[1,]))
{
head = sprintf("Topic %d:",i)
writeLines(paste(c(head,terms_lda[,i]),collapse=" "),filew)
}
close(filew)
Topic 1: like store get time dont day selection home well people place right make always location good service another need went
Topic 2: store place can great find good back well like selection location items dont love see always nice friendly staff customer
Topic 3: like also find ive store never can time place said even good now love service going make dont items great
Topic 4: great store back even really place get buy good going since shop everything got much try prices staff nice like
Topic 5: get really great time always shopping find shop new place stores love like went little prices nice need dont also
%%R
print(colnames(reviews))
print(head(reviews$date))
dates <- as.Date(reviews$date, "%Y-%m-%d")
numberOfReviewsByYear = c(rep(length(2005:2014)))
for(i in 2005:2014){
reviews$year[(dates>=(as.Date(sprintf("%d-01-01",i))))&(dates<=(as.Date (sprintf("%d-12-31",i))))]<-i
}
for(i in 2005:2014){
numberOfReviewsByYear[i-2004] <- sum(reviews$year==i)
}
names(numberOfReviewsByYear) <- NA
mp <- barplot(numberOfReviewsByYear, col = "blue")
axis(1, at = mp, labels = 2005:2014, cex.axis = 1)
%%R -w 1000
opar <- par(mfrow = c(2,5), oma = c(0, 0, 1.1, 0))
for(i in 2005:2014){
usefulVotes <- reviews$votes_useful[reviews$year == i]
hist((usefulVotes), freq=FALSE, col = "blue",border = NA)
mtext(sprintf("Year = %d",i))
}
par(opar)
%%R -w 1100
usefulPercentile <- ddply(reviews, .(votes_useful,year), summarize,
percentile = sum(reviews$votes_useful[reviews$year==year]<votes_useful)/
length(reviews$votes_useful[reviews$year==year]))
opar <- par(mfrow = c(2,5), oma = c(0, 0, 1.1, 0))
for(i in 2005:2014){
plot(usefulPercentile$votes_useful[usefulPercentile$year==i], usefulPercentile$percentile[usefulPercentile$year==i],
type="l",xlab = "votes_useful",ylab = "Percentile")
points(usefulPercentile$votes_useful[usefulPercentile$votes_useful %in% c(1,5,10) & usefulPercentile$year==i],
usefulPercentile$percentile[usefulPercentile$votes_useful %in% c(1,5,10) & usefulPercentile$year==i],
pch=20,col = "blue")
text( usefulPercentile$votes_useful[usefulPercentile$votes_useful %in% c(1,5,10) & usefulPercentile$year==i],
usefulPercentile$percentile[usefulPercentile$votes_useful %in% c(1,5,10) & usefulPercentile$year==i],
c(sprintf("(0,%.3f)",usefulPercentile$percentile[usefulPercentile$votes_useful==1 & usefulPercentile$year==i]),
sprintf("(5,%.3f)",usefulPercentile$percentile[usefulPercentile$votes_useful==5 & usefulPercentile$year==i]),
sprintf("(10,%.3f)",usefulPercentile$percentile[usefulPercentile$votes_useful==10 & usefulPercentile$year==i])
),
data = usefulPercentile[usefulPercentile$votes_useful %in% c(1,5,10) & usefulPercentile$year==i,],
pos=4, cex=1, col="blue" )
}
par(opar)
number of paragraphs, number of words, number of urls, number of numbers, number of capitals, number of special marks:number of smileys, number of quotation marks, number of exclmations, number of ellipsis, number of brackets, number of asterisks, number of colons, number of dollar signs
average, maximum, minimum, median, stand deviation of useful, cool, funny votes and stars of a certain user(business)
number of checkins, (number of reviews)/(number of checkins)
Review text clustering labels due to the text similarity in word stems
Number of adjectives total 58 features
NOTE: The reviews of shopping category is used to demonstate. We divided the whole dataset by the date and make the training set and test set size approximately on the scale of 10:1.
%%R
reviews <- read.csv(file = "YelpCsv/yelp_review_test.csv")
users <- read.csv(file = "YelpCsv/yelp_academic_dataset_user.csv")
business <- read.csv(file = "YelpCsv/yelp_academic_dataset_business.csv")
checkin <- read.csv(file = "YelpCsv/yelp_academic_dataset_checkin.csv")
checkin[is.na(checkin)] = 0
business = business[,-which(colnames(business)=='type')]
checkin = checkin[,-which(colnames(checkin)=='type')]
library(plyr)
##################################################################
# handling review data
##################################################################
# append year to reivew
dates <- as.Date(reviews$date, "%Y-%m-%d")
for(i in 2005:2014){
reviews$year[(dates>=(as.Date(sprintf("%d-01-01",i))))&(dates<=(as.Date (sprintf("%d-12-31",i))))]<-i
}
# extract features from text
len = nrow(reviews)
reviews$num_words = rep(0,len)
reviews$num_smiley = rep(0,len)
reviews$num_url = rep(0,len)
reviews$num_nums = rep(0,len)
reviews$num_capital = rep(0,len)
reviews$num_paragraph = rep(0,len)
reviews$num_exclamation = rep(0,len)
reviews$num_quotationPair = rep(0,len)
reviews$num_ellipsis = rep(0,len)
reviews$num_bracketsPair = rep(0,len)
reviews$num_asterisk = rep(0,len)
reviews$num_colon = rep(0,len)
reviews$num_dollarSign = rep(0,len)
smileyPattern = "(:|=|8|x|X|B){1}(-|o|c|\\^|)?(\\)|\\]|\\>|\\}|3|D){1}"
urlPattern = "[.]((com)|(edu)|(gov)|(int)|(mil)|(net)|(org)|(biz)|(info)|(pro)|(name)|(museum)|(coop)|(aero)|(xxx)|(idv))"
temp_text = as.character(reviews$text)
# Num of words
reviews$num_words = sapply(gregexpr("\\b\\W+\\b", temp_text, perl=TRUE), function(x) sum(x>0))+1
# Num of paragraphs
reviews$num_paragraph = sapply(gregexpr("\n+", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of smileys
reviews$num_smiley = sapply(gregexpr(smileyPattern, temp_text, perl=TRUE), function(x) sum(x>0))
# Num of urls
reviews$num_url = sapply(gregexpr(urlPattern, temp_text, perl=TRUE), function(x) sum(x>0))
# Num of nums
reviews$num_nums = sapply(gregexpr("[0-9]+[.]?[0-9]*", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of capitals
reviews$num_capital = sapply(gregexpr("[[:upper:]]{2,}", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of exclamation
reviews$num_exclamation = sapply(gregexpr("!", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of quotationPair
reviews$num_quotationPair = sapply(gregexpr("(\"+?.+?\"+?)|(\\W+\'+?.+?\'+?)", temp_text, perl=TRUE),function(x) sum(x>0))
# Num of ellipsis
reviews$num_ellipsis = sapply(gregexpr("[.]{2,}", temp_text, perl=TRUE),function(x) sum(x>0))
# Num of bracketsPair
reviews$num_bracketsPair = sapply(gregexpr("([(]+?.*?[)]+?)|([[]+?.*?[]]+?)|([{]+?.*?[}]+?)|([<]+?.*?[>]+?)",
temp_text, perl=TRUE), function(x) sum(x>0))
# Num of asterisk
#Might not be useful since used to seprate lines
reviews$num_asterisk = sapply(gregexpr("[*]", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of colon
reviews$num_colon = sapply(gregexpr(":", temp_text, perl=TRUE), function(x) sum(x>0))
# Num of dollarSign
reviews$num_dollarSign = sapply(gregexpr("[$]",temp_text,perl=TRUE), function(x) sum(x>0))
##################################################################
# handling business data
##################################################################
if (exists("categoryIndex")){
rm(categoryIndex)
}
BigCategories<- c("Restaurants","Nightlife","Shopping","Food","Health & Medical","Beauty & Spas",
"Home Services","Local Services","Event Planning & Services","Arts & Entertainment",
"Active Life","Professional Services","Automotive","Hotels & Travel","Education","Real Estate",
"Pets","Financial Services","Local Flavor","Public Services & Government","Mass Media","Religious Organization")
#categoryIndex<-data.frame()
for(category in BigCategories){
categoryname<-gsub(" ","",category)
categoryname<-gsub("&","",categoryname)
categoryname<-sprintf("%s_index",categoryname)
if (!exists("categoryIndex")){
Express<- sprintf("%s<-grepl(category,business$categories,ignore.case = TRUE)
categoryIndex<-data.frame(%s)",categoryname,categoryname)
}
else{
Express<- sprintf("categoryIndex$%s <- grepl(category,business$categories,ignore.case = TRUE)",categoryname)
}
eval(parse(text = Express))
}
categoryIndexNames = names(categoryIndex)
business = cbind(business,categoryIndex)
business2<-ddply(reviews,.(business_id),summarise,
b.avg_stars = mean(stars),
b.avg_useful = mean(votes_useful),
b.avg_cool = mean(votes_cool),
b.avg_funny = mean(votes_funny),
b.median_useful = median(votes_useful),
b.median_cool = median(votes_cool),
b.median_funny = median(votes_funny),
b.median_stars = median(stars),
b.sd_useful = sqrt(var(votes_useful)),
b.sd_cool = sqrt(var(votes_cool)),
b.sd_funny = sqrt(var(votes_funny)),
b.sd_stars = sqrt(var(stars)),
b.min_useful = min(votes_useful),
b.min_cool = min(votes_cool),
b.min_funny = min(votes_funny),
b.min_stars = min(stars),
b.max_useful = max(votes_useful),
b.max_cool = max(votes_cool),
b.max_funny = max(votes_funny),
b.max_stars = max(stars)
)
businessFeatureNames = names(business2[,-1])
business = merge(business,business2,by = "business_id",all.x = TRUE)
# reviews of the business / #check-ins (to get a coefficient to weight # of visits)
### #check-ins --- num_checkin, review_count
## business$ avg_reviews_checkin
checkin$num_checkin<- rowSums(checkin[,-1])
business<-merge(business,checkin[,c("business_id","num_checkin")],by = "business_id",all.x = TRUE)
business$avg_reviews_checkin = business$review_count/business$num_checkin
# find difference between user rating & business average rating
# Then append business features: categories,num_checkin,avg_reviews_checkin and diff_urate_brate to review
# Difference between user rating & business average rating
## reviews$ diff_urate_brate
old_names = colnames(business)
new_names = old_names
new_names[which(new_names=="stars")]="b.stars"
names(business) = new_names
reviews<-merge(reviews,business[,c("business_id",categoryIndexNames,"b.stars","num_checkin","avg_reviews_checkin",businessFeatureNames)],
by = "business_id",all.x = TRUE)
reviews$diff_urate_brate = reviews$stars-reviews$b.stars
colnames(business) = old_names
##################################################################
# handling users data
##################################################################
# No. of distinct categories that user has reviewed --- need merge of reviews and business
# No. of distinct businesses the user has reviewed average / median / sd / min / max of useful/cool/funny votes
Expr = parse(text = sprintf("cbind(%s)",paste(categoryIndexNames,collapse = ",")))
users2<-ddply(reviews,.(user_id),summarise,
u.num_category = sum(colSums(eval(Expr))>0),
u.num_business = length(unique(business_id)),
u.median_useful = median(votes_useful),
u.median_cool = median(votes_cool),
u.median_funny = median(votes_funny),
u.median_stars = median(stars),
u.sd_useful = sqrt(var(votes_useful)),
u.sd_cool = sqrt(var(votes_cool)),
u.sd_funny = sqrt(var(votes_funny)),
u.sd_stars = sqrt(var(stars)),
u.min_useful = min(votes_useful),
u.min_cool = min(votes_cool),
u.min_funny = min(votes_funny),
u.min_stars = min(stars),
u.max_useful = max(votes_useful),
u.max_cool = max(votes_cool),
u.max_funny = max(votes_funny),
u.max_stars = max(stars)
)
# user votes useful / #reviews ( to get the average of useful votes x review – same x cool & funny)
## users$ avg_useful, avg_funny, avg_cool
users2$u.avg_useful = users$votes_useful/users$review_count
users2$u.avg_funny = users$votes_funny/users$review_count
users2$u.avg_cool = users$votes_cool/users$review_count
userFeatureNames = names(users2)
users = merge(users,users2,by = "user_id",all.x = TRUE)
# append user avg_useful, avg_funny, avg_cool to the review data
reviews = merge(reviews,users[,c(userFeatureNames)],by = "user_id", all.x = TRUE)
import nltk
import string
import os
import csv
import re
import numpy as np
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import *
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
from __future__ import division
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import spectral_clustering
from sklearn.cluster import KMeans
from scipy import sparse
# import pyamg
from sklearn.decomposition import RandomizedPCA
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.tokenize.regexp import RegexpTokenizer
import mpl_toolkits.mplot3d.axes3d as p3
import pylab as pl
%matplotlib inline
# %pylab inline
import numpy.ma as ma
from mayavi import mlab
import networkx as nx
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import linear_kernel
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
numInText = re.compile(r'[0-9]')
reviewFilePath = 'reviewsShopping1.csv'
reviews = {}
usefulvotes = {}
csvfile = open(reviewFilePath,'rb')
spamreader = csv.reader(csvfile,delimiter =',',quotechar='"')
i = -1
for row in spamreader:
if i==-1:
i = i+1
head = row
else:
text = row[head.index('text')]
lowers = text.lower()
no_num = numInText.sub('',lowers)
no_punctuation = no_num.translate(None,string.punctuation)
usefulvotes[row[head.index('review_id')]] = int(row[head.index('votes_useful')])
reviews[row[head.index('review_id')]] = no_punctuation
csvfile.close()
#### Extract features
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(reviews.values())
pca_dim_reduction = RandomizedPCA(n_components = 200)
Low_dim_Component = pca_dim_reduction.fit_transform(tfs)
# affinity = cosine_similarity(tfs, tfs)
n_clusters = 10
# s_affinity = sparse.csr_matrix(affinity)
# labels_pred = spectral_clustering(s_affinity, k=n_clusters, mode='amg')
clusteringMethod = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
clusteringMethod.fit(tfs)
labels_pred = clusteringMethod.predict(tfs)
## Find number of adjs
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
num_adjs = {}
for review_id,text in reviews.iteritems():
tagged_text = nltk.pos_tag(tokenizer.tokenize(text))
num_adjs[review_id] = len([word
for (word,tag) in tagged_text
if tag in ['JJ','JJR','JJS']])
if i%1000==0:
print i
with open('textFeaturesShopping.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow(['review_id','num_adjs','cluster_label'])
for i in range(len(reviews)):
id = reviews.keys()[i]
spamwriter.writerow([id, str(num_adjs[id]), str(labels_pred[i])])
csvfile.close()
usefulLevel = {}
for id,vote in usefulvotes.iteritems():
if vote==0:
usefulLevel[id] = 0
elif vote>5:
usefulLevel[id] = 2
else:
usefulLevel[id] = 1
# down sample
import random
review_d_ids = random.sample(reviews.keys(),10000)
reviews_d = {}
usefulLevel_d = {}
for id in review_d_ids:
reviews_d[id] = reviews[id]
usefulLevel_d[id] = usefulLevel[id]
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(reviews_d.values())
pca_dim_reduction = RandomizedPCA(n_components = 200)
Low_dim_Component = pca_dim_reduction.fit_transform(tfs)
n_clusters = 10
clusteringMethod = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
clusteringMethod.fit(Low_dim_Component)
labels_pred = clusteringMethod.predict(Low_dim_Component)
## Draw result for a down sampled data
pca3 = RandomizedPCA(n_components = 3)
xyz = pca3.fit_transform(tfs)
fig = pl.figure(figsize=(8,8))
ax = p3.Axes3D(fig)
# ax.view_init(7, -80)
for l in np.unique(labels_pred):
ax.plot3D(xyz[labels_pred == l, 0],
xyz[labels_pred == l, 1],
xyz[labels_pred == l, 2],
'o', color=pl.cm.jet(float(l) / np.max(labels_pred + 1)))
pl.title('Clusters')
pl.show()
title = ["Class 'Low'", "Class 'Medium'", "Class 'High'"]
for id in review_d_ids:
reviews_d[id] = reviews[id]
usefulLevel_d[id] = usefulLevel[id]
bins = [0,1,2,3,4,5,6,7,8,9,10]
pl.figure(figsize=(10,9))
for l in range(3):
pl.subplot(3,1,l+1)
pl.hist(labels_pred[np.array(usefulLevel_d.values())==l],bins, histtype='bar', rwidth=0.5)
pl.title(title[l])
pl.show()
From the histogram above, we can see that in different clusters, the amount of 'Low', 'Medium', 'High' varies. For example, cluster 1 is constructive in distinguishing 'High' from 'Medium' and 'Low'
RGBcolors = [[1,0.2,0.2],
[1,0.6,0.2],[1,1,0.2],
[0.6,1,0.2],[0.2,1,0.2],
[0.2,1,0.6],[0.2,1,1],
[0.2,0.6,1],[0.2,0.2,1],
[0,0,1]
]
usefulLevel = {}
for id,vote in usefulvotes.iteritems():
if vote>9:
usefulLevel[id] = RGBcolors[9]
else:
usefulLevel[id] = RGBcolors[vote]
# down sample
reviews_d = {}
usefulLevel_d = {}
for id in review_d_ids:
reviews_d[id] = reviews[id]
usefulLevel_d[id] = usefulLevel[id]
g = nx.Graph()
forth_cos_sim = []
edge_list = set()
threshold = 0.5
num_docs = len(reviews_d)
for i in range(num_docs):
cosine_similarities = linear_kernel(tfs[i:i+1], tfs).flatten()
# related_docs_indices = cosine_similarities.argsort()[:-5:-1]
related_docs_indices = [ind for ind in range(num_docs)
if cosine_similarities[ind]>threshold]
# forth_cos_sim.append(cosine_similarities[related_docs_indices[-1]])
# g.add_node(i,color = usefulLevel_d.values()[i])
g.add_edges_from([(i,j,{'weight':cosine_similarities[j]})
for j in related_docs_indices if j>i])
edge_list.update([(i,j)
for j in related_docs_indices if j>i])
from sklearn.decomposition import RandomizedPCA
pca3 = RandomizedPCA(n_components = 3)
positions = pca3.fit_transform(tfs)
pos01={}
pos02={}
pos12={}
for node in g.nodes():
pos01[node] = positions[node,0:2]
pos02[node] = [positions[node,0],positions[node,2]]
pos12[node] = positions[node,1:3]
# draw 2-D plots
from matplotlib import gridspec
plt.figure(figsize=(5,15))
gs = gridspec.GridSpec(3, 1, height_ratios=[1, 1,1])
plt.subplot(gs[0])
nx.draw(g,pos01, node_color=[usefulLevel_d.values()[v] for v in g.nodes()],
node_size=50,alpha = 0.8,
with_labels=None)
plt.subplot(gs[1])
nx.draw(g,pos02, node_color=[usefulLevel_d.values()[v] for v in g.nodes()],
node_size=50,alpha = 0.8,
with_labels=None)
plt.subplot(gs[2])
nx.draw(g,pos12, node_color=[usefulLevel_d.values()[v] for v in g.nodes()],
node_size=50,alpha = 0.8,
with_labels=None)
plt.show()
We want to plot the network from different perspectives to observe its connectivities. But its hard to look at it clear due to the perspective limitation.
We plotted a 3D network to observe the connection of the documents. Red nodes are 'Low', green nodes are 'Medium', and blue nodes are 'High'. As we can see, the low useful text tends to spread out in the dimension-reduced 'word' space, and have lower connectivity to other nodes.
from IPython.core.display import Image
Image('Network3D.png')
We looked into the features with parallel coordinates plots, and we can observe that features from users and business can distinguish the three class in some sense.
%%R
reviewsShopping <- read.csv(file = "/Users/shali/Documents/CS249/Midterm/reviewsShopping1.csv")
reviewsShopping[,c(12:24,47:91)] = na.roughfix(reviewsShopping[,c(12:24,47:91)])
%%R -w 1100
library(MASS)
reviewsShopping$class = ifelse(reviewsShopping$votes_useful==0,0,ifelse(reviewsShopping$votes_useful>5,2,1))
featurenames1 = colnames(reviewsShopping)[c(12:24,48,49,70,92)]
reviewfeatures1 = reviewsShopping[,featurenames1]
parcoord(reviewfeatures1, col=reviewfeatures1$class+2, var.label=TRUE)
%%R -w 1100
featurenames2 = colnames(reviewsShopping)[c(47,50:69,92)]
#featurenames4 = colnames(reviews)[77:92]
reviewfeatures2 = reviewsShopping[,featurenames2]
#reviewfeatures4 = reviews[,featurenames4]
parcoord(reviewfeatures2, col=reviewfeatures1$class+2, var.label=TRUE)
%%R -w 1100
featurenames3 = colnames(reviewsShopping)[c(71:92)]
reviewfeatures3 = reviewsShopping[,featurenames3]
parcoord(reviewfeatures3, col=reviewfeatures1$class+2, var.label=TRUE)
From the parallel coordinate plot above we can see some constructive features, e.g. u.median_cool. We can see that data are grouped accoding to their classes on this feature.
We first tried Multiclass Classification One vs All strategy with gbm, and the accuracy is about 74.71%. We then tried Random Forest Multiclass classification, and the accuracy is 80.07%. The accuracies for directly doing multiclass classification are not good as we expected. We decide to use two-layer binary classification, to see if we have any improvement.
We first use non-text 59 features to do classification. Then we add text features(number of adjectives in review text, cluster index) to do classification.
Gradient boosting is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. The gradient boosting method can also be used for classification problems by reducing them to regression with a suitable loss function. In this project, the 'gbm' package in R is used.
The missing data in the dataset is replaced by the median number and we also normalize the features to get ready for regression.
%%R
trainShopping = read.csv("/Users/shali/Documents/CS249/Midterm/trainShopping.csv")
testShopping = read.csv("/Users/shali/Documents/CS249/Midterm/testShopping.csv")
cols = c(1,4:dim(trainShopping)[2])
train = trainShopping[,cols]
#Remove NAs
train[is.na(train)] = 0
#Normalize the data
#train = apply(train,2,function(x) x/max(x))
train = data.frame(train)
#print(dim(train))
#print(colnames(train))
test = testShopping[,cols]
#Remove NAs
test[is.na(test)] = 0
#Normalize the data
#test = apply(test,2,function(x) x/max(x))
test = data.frame(test)
#print(dim(test))
#print(colnames(test))
#Use two layer classification
#First label 1 or 0 by 'zero' votes
#rm = "b.min_funny"
#First regression
train.bernoulli = train
train.bernoulli$high = ifelse(train.bernoulli$votes_useful >= 5, 1, 0)
train.bernoulli = train.bernoulli[,c(dim(train.bernoulli)[2],1,3:(dim(train.bernoulli)[2]-1))]
train.bernoulli = apply(train.bernoulli,2,function(x) x/max(x))
train.bernoulli = data.frame(train.bernoulli)
#train.bernoulli = train.bernoulli[,!colnames(train.bernoulli)%in% rm]
test.bernoulli = test
test.bernoulli$high = ifelse(test.bernoulli$votes_useful >= 5, 1, 0)
test.bernoulli = test.bernoulli[,c(dim(test.bernoulli)[2],1,3:(dim(test.bernoulli)[2]-1))]
test.bernoulli = apply(test.bernoulli,2,function(x) x/max(x))
test.bernoulli = data.frame(test.bernoulli)
#test.bernoulli = test.bernoulli[,!colnames(test.bernoulli)%in% rm]
#print(colnames(train.bernoulli))
#cat("\n")
#print(colnames(test.bernoulli))
gbm.high = gbm(high~.,
distribution = "bernoulli", data = train.bernoulli, var.monotone = NULL,
n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
n.cores = NULL)
gbm.high.bestIter = gbm.perf(gbm.high, method="cv")
#print(gbm.high.bestIter)
#Prediction test error
gbm.high.predict_test = predict(gbm.high, test.bernoulli, gbm.high.bestIter,type="response",single.tree=FALSE)
#Second regression
train.bernoulli = train
train.bernoulli$medium = ifelse((train.bernoulli$votes_useful <= 5 &
train.bernoulli$votes_useful >0), 1, 0)
train.bernoulli = train.bernoulli[,c(dim(train.bernoulli)[2],1,3:(dim(train.bernoulli)[2]-1))]
train.bernoulli = apply(train.bernoulli,2,function(x) x/max(x))
train.bernoulli = data.frame(train.bernoulli)
#train.bernoulli = train.bernoulli[,!colnames(train.bernoulli)%in% rm]
test.bernoulli = test
test.bernoulli$medium = ifelse((test.bernoulli$votes_useful <= 5 &
test.bernoulli$votes_useful >0), 1, 0)
test.bernoulli = test.bernoulli[,c(dim(test.bernoulli)[2],1,3:(dim(test.bernoulli)[2]-1))]
test.bernoulli = apply(test.bernoulli,2,function(x) x/max(x))
test.bernoulli = data.frame(test.bernoulli)
#test.bernoulli = test.bernoulli[,!colnames(test.bernoulli)%in% rm]
#print(colnames(train.bernoulli))
#cat("\n")
#print(colnames(test.bernoulli))
gbm.med = gbm(medium~.,
distribution = "bernoulli", data = train.bernoulli, var.monotone = NULL,
n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
n.cores = NULL)
gbm.med.bestIter = gbm.perf(gbm.med, method="cv")
#print(gbm.med.bestIter)
#Prediction, test error
gbm.medium.predict_test = predict(gbm.med, test.bernoulli, gbm.med.bestIter,type="response",single.tree=FALSE)
#Third regression
train.bernoulli = train
train.bernoulli$low = ifelse(train.bernoulli$votes_useful <= 0 , 1, 0)
train.bernoulli = train.bernoulli[,c(dim(train.bernoulli)[2],1,3:(dim(train.bernoulli)[2]-1))]
train.bernoulli = apply(train.bernoulli,2,function(x) x/max(x))
train.bernoulli = data.frame(train.bernoulli)
#train.bernoulli = train.bernoulli[,!colnames(train.bernoulli)%in% rm]
test.bernoulli = test
test.bernoulli$low = ifelse(test.bernoulli$votes_useful <= 0, 1, 0)
test.bernoulli = test.bernoulli[,c(dim(test.bernoulli)[2],1,3:(dim(test.bernoulli)[2]-1))]
test.bernoulli = apply(test.bernoulli,2,function(x) x/max(x))
test.bernoulli = data.frame(test.bernoulli)
#test.bernoulli = test.bernoulli[,!colnames(test.bernoulli)%in% rm]
#print(colnames(train.bernoulli))
cat("\n")
#print(colnames(test.bernoulli))
gbm.low = gbm(low~.,
distribution = "bernoulli", data = train.bernoulli, var.monotone = NULL,
n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
n.cores = NULL)
gbm.low.bestIter = gbm.perf(gbm.low, method="cv")
#print(gbm.low.bestIter)
#Prediction, test error
gbm.low.predict_test = predict(gbm.low, test.bernoulli, gbm.low.bestIter,type="response",single.tree=FALSE)
#Vote for the class
predict_test = matrix(data = c(gbm.low.predict_test,gbm.medium.predict_test,
gbm.high.predict_test), nrow = length(gbm.high.predict_test), ncol = 3, byrow = FALSE,
dimnames = NULL)
#print(dim(predict_test))
predict_test = max.col(predict_test, ties.method = c("random"))
#print(length(predict_test))
#print(unique(predict_test))
quality_test = ifelse(test$votes_useful >= 5, 3, ifelse(test$votes_useful > 0, 2, 1))
cat("The overall accuracy on testing dataset is ", mean(quality_test == predict_test))
%%R
#Save the gbm object for later use
save(gbm.high, file = "gbm.high.RData")
save(gbm.med, file = "gbm.med.RData")
save(gbm.low, file = "gbm.low.RData")
from IPython.display import Image
Image(filename='/Users/shali/Documents/CS249/Midterm/CS249_Project/Twolayer.jpg',
embed=True)
%%R
#Load gbm library
#Use Shopping reviews as an example
library(gbm)
library(randomForest)
%%R
#trainShopping, testShopping data 10:1
trainShopping = read.csv(file = "/Users/shali/Documents/CS249/Midterm/CS249_Project/shopping_train.csv")
testShopping = read.csv(file = "/Users/shali/Documents/CS249/Midterm/CS249_Project/shopping_test.csv")
Extract the useful columns of the dataset.
%%R
cols = c(5,8:10,12:24,47:91)
trainShopping = na.roughfix(trainShopping[,cols])
testShopping = na.roughfix(testShopping[,cols])
cols = c(1,4:dim(trainShopping)[2])
train = trainShopping[,cols]
train = data.frame(train)
cat("The training dataset dimension is: \n")
print(dim(train))
cat("The training dataset colnames: \n")
print(colnames(train))
test = testShopping[,cols]
test = data.frame(test)
cat("The testing dataset dimension is: \n")
print(dim(test))
cat("The testing dataset colnames: \n")
print(colnames(test))
Use two layer classification, Bernoulli regression is used below.
%%R
#Use two layer classification, Bernoulli regression is used
#First layer classification
#First label 1 or 0 by 'zero' votes
rm = ""
train.bernoulli = train
train.bernoulli$quality1 = ifelse(train.bernoulli$votes_useful > 0, 1, 0)
train.bernoulli = train.bernoulli[,c(dim(train.bernoulli)[2],1,3:(dim(train.bernoulli)[2]-1))]
train.bernoulli = apply(train.bernoulli,2,function(x) x/max(x))
train.bernoulli = data.frame(train.bernoulli)
train.bernoulli = train.bernoulli[,!colnames(train.bernoulli)%in% rm]
test.bernoulli = test
test.bernoulli$quality1 = ifelse(test.bernoulli$votes_useful > 0, 1, 0)
test.bernoulli = test.bernoulli[,c(dim(test.bernoulli)[2],1,3:(dim(test.bernoulli)[2]-1))]
test.bernoulli = apply(test.bernoulli,2,function(x) x/max(x))
test.bernoulli = data.frame(test.bernoulli)
test.bernoulli = test.bernoulli[,!colnames(test.bernoulli)%in% rm]
#print(colnames(train.bernoulli))
#cat("\n")
#print(colnames(test.bernoulli))
#First layer training
gbm2 = gbm(quality1~.,
distribution = "bernoulli", data = train.bernoulli, var.monotone = NULL,
n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
n.cores = NULL)
gbm2.bestIter = gbm.perf(gbm2, method="cv")
#print(gbm2.bestIter)
#Prediction, train error
gbm2.predict_train = predict(gbm2, train.bernoulli, gbm2.bestIter,type="response",single.tree=FALSE)
gbm2.qualityPredict_train = ifelse(gbm2.predict_train > 0.5, 1, 0)
cat("The accuracy for training set on first layer is ",
mean(gbm2.qualityPredict_train == train.bernoulli$quality1),"\n")
#Prediction, test error
gbm2.predict_test = predict(gbm2, test.bernoulli, gbm2.bestIter,type="response",single.tree=FALSE)
#print(max(gbm2.predict_test))
#print(min(gbm2.predict_test))
gbm2.qualityPredict_test = ifelse(gbm2.predict_test > 0.5, 1, 0)
cat("The accuracy for testing set on first layer is ",
mean(gbm2.qualityPredict_test == test.bernoulli$quality1),"\n")
#Second Classification
rm = ""
train.bernoulli$votes_useful = train$votes_useful # Include the votes_useful column
#train set contain only high medium review
train.bernoulli.hm = train.bernoulli[train.bernoulli$quality1 == 1,]
#Relabel 1 or 0 for reviews of high or medium quality
train.bernoulli.hm$quality2 = ifelse(train.bernoulli.hm$votes_useful >=5, 1,0)
m = dim(train.bernoulli.hm)[2]
train.bernoulli.hm = train.bernoulli.hm[,c(m,2:(m-2))] #Reorder the data frame
train.bernoulli.hm = train.bernoulli.hm[,!colnames(train.bernoulli.hm) %in% rm]
#print(colnames(train.bernoulli.hm))
#cat("\n")
#Do the same for the test set
test.bernoulli$votes_useful = test$votes_useful
#test set contain only high medium review from the prediction
test.bernoulli.hm = test.bernoulli[gbm2.qualityPredict_test == 1,]
test.bernoulli.hm$quality2 = ifelse(test.bernoulli.hm$votes_useful >=5, 1,0)
m = dim(test.bernoulli.hm)[2]
test.bernoulli.hm = test.bernoulli.hm[,c(m,2:(m-2))]
test.bernoulli.hm = test.bernoulli.hm[,!colnames(test.bernoulli.hm) %in% rm]
#print(colnames(test.bernoulli.hm))
#Second Train
gbm3 = gbm(quality2 ~.,
distribution = "bernoulli", data = train.bernoulli.hm, var.monotone = NULL,
n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
n.cores = NULL)
gbm3.bestIter = gbm.perf(gbm3, method="cv")
#print(gbm3.bestIter)
#Prediction, train error
gbm3.predict_train = predict(gbm3, train.bernoulli.hm, gbm3.bestIter,type="response",single.tree=FALSE)
gbm3.qualityPredict_train = ifelse(gbm3.predict_train > 0.5, 1, 0)
cat("The accuracy for training set on second layer is ",
mean(gbm3.qualityPredict_train == train.bernoulli.hm$quality2),"\n")
#print(dim(train.bernoulli.hm)[1])
#print(length(train.bernoulli.hm$quality2[train.bernoulli.hm$quality2 == 1]))
#Combine the high, medium, low and calculate the train accuracy
index0 = which(gbm2.qualityPredict_train == 0)
#gbm3.qualityPredictRelabel_train = ifelse(gbm3.qualityPredict_train == 1, 3, 2)
qualityPredict.train.bernoulli = c(gbm3.qualityPredict_train,
gbm2.qualityPredict_train[index0])
#train.bernoulli.hm.qualityRelabel = ifelse(train.bernoulli.hm$quality2 == 1, 3, 2)
trainQuality.bernoulli = c(train.bernoulli.hm$quality2,
train.bernoulli$quality1[index0] )
cat("The overall accuracy for training set is ",
mean(qualityPredict.train.bernoulli == trainQuality.bernoulli),"\n")
# Prediction, test error
gbm3.predict_test = predict(gbm3, test.bernoulli.hm, gbm3.bestIter,type="response",single.tree=FALSE)
gbm3.qualityPredict_test = ifelse(gbm3.predict_test >0.5, 1, 0)
cat("The accuracy for testing set on second layer is ",
mean(gbm3.qualityPredict_test == test.bernoulli.hm$quality2),"\n")
#print(dim(test.bernoulli.hm)[1])
#print(length(test.bernoulli.hm$quality2[test.bernoulli.hm$quality2 == 1]))
#Combine the high, medium, low and calculate the test accuracy
index0 = which(gbm2.qualityPredict_test == 0)
#gbm3.qualityPredictRelabel_test = ifelse(gbm3.qualityPredict_test == 1, 3, 2)
qualityPredict.test.bernoulli = c(gbm3.qualityPredict_test,
gbm2.qualityPredict_test[index0])
#test.bernoulli.hm.qualityRelabel = ifelse(test.bernoulli.hm$quality2 == 1, 3, 2)
testQuality.bernoulli = c(test.bernoulli.hm$quality2, test.bernoulli$quality1[index0])
#Print the accuracy
cat("The overall accuracy on testing data is ",
mean(qualityPredict.test.bernoulli == testQuality.bernoulli), "\n")
We run the code above and save the R object to save time for analysis. Here we simply load the trained gbm and show plots.
%%R
load("/Users/shali/Documents/CS249/Midterm/gbm2.RData")
load("/Users/shali/Documents/CS249/Midterm/gbm3.RData")
load("/Users/shali/Documents/CS249/Midterm/gbm2.bestIter.RData")
load("/Users/shali/Documents/CS249/Midterm/gbm3.bestIter.RData")
First we do gbm summary for the first-layer gbm and calculate the relative importance of features.
%%R
summary(gbm2, n.trees = gbm2.bestIter)
Next we do gbm summary for the second-layer gbm and calculate the relative importance of features.
%%R
summary(gbm3, n.trees = gbm3.bestIter)
We select the feature that has relative importance more than one and plot the pie chart to make the result clear.
%%R -w 900 -h 600
#Pie plot of feature relative importance
par(mfrow = c(1,2))
slices <- c(55.62,9.23,6.81,5.11,4.13,3.04,2.78,2.59,1.69,1.22)
lbls <- c("u.median_useful","b.median_useful","u.max_useful",
"b.avg_useful","u.min_useful","u.median_cool",
"num_words","b.min_useful","u.sd_useful","stars")
pie(slices, labels = lbls, main="Feature Relative Importance > 1(First Layer)")
slices1 <- c(37.98,15.50,12.19,5.00,3.12,2.55,2.46,2.15,2.08,1.88,1.33,1.02)
lbls1 <- c("u.median_useful","b.max_useful","u.sd_useful",
"b.sd_useful","u.max_useful","u.median_cool",
"num_words","b.avg_useful","u.sd_cool","num_paragraph",
"u.sd_funny","u.median_funny")
pie(slices1, labels = lbls1, main="Feature Relative Importance > 1(Second Layer)")
It is obvious that most of the key features are the statistics of the users, e.g., u.median_useful(median of user-level useful votes), b_median_useful(median of business-level useful votes). It means that if the user is reliable(might be Yelp Elite) who often write good reviews, then a review by the user is more likely to be userful. On the business level, if the business is of high quality, it is going to receive useful votes itself.
As the result indicates, some features are not as useful as others but any non-zero-importance features actually contribute to the loss reduction and accuracy. Thus it is not necessary to remove features for gbm case. The accuracy remains almost the same.
The Gaussian distribution is used to model the loss function. We use votes_useful as the response instead 0 or 1 label to do regression and measure the accuracy afterwards using labels.
%%R
#Use gaussian model, first do regression and then calculate accuracy by label reviews
train.gaussian = train[,c(2,1,3:(dim(train)[2]-1))]
test.gaussian = test[,c(2,1,3:(dim(test)[2]-1))]
m = dim(train.gaussian)[2]
votes_temp = train.gaussian$votes_useful
train.gaussian = data.frame(apply(train.gaussian[,c(2:m)], 2, function(x) x/max(x)))
train.gaussian$votes_useful = votes_temp
train.gaussian = train.gaussian[,c(m,1:(m-1))]
cat("The colnames in training set: \n")
print(colnames(train.gaussian))
cat("The maximum votes_useful in training set is: \n")
print(max(train.gaussian$votes_useful))
m = dim(test.gaussian)[2]
votes_temp = test.gaussian$votes_useful
test.gaussian = data.frame(apply(test.gaussian[,c(2:m)], 2, function(x) x/max(x)))
test.gaussian$votes_useful = votes_temp
test.gaussian = test.gaussian[,c(m,1:(m-1))]
cat("The colnames in training set: \n")
print(colnames(test.gaussian))
cat("The maximum votes_useful in training set is: \n")
print(max(test.gaussian$votes_useful))
gbm.gaussian = gbm(votes_useful ~ . ,
distribution = "gaussian", data = train.gaussian, var.monotone = NULL,
n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
n.cores = NULL)
gbm.gaussian.bestIter = gbm.perf(gbm.gaussian,method = "cv")
#print(gbm.gaussian.bestIter)
#Prediction, Train error
gbm.gaussian.predict_train = predict(gbm.gaussian, train.gaussian, gbm.gaussian.bestIter,type="link",single.tree=FALSE)
#print(max(gbm.gaussian.predict_train))
#print(max(train.gaussian$votes_useful))
gbm.gaussian.qualityPredict_train = ifelse(gbm.gaussian.predict_train >= 5, 3,
ifelse(gbm.gaussian.predict_train > 0, 2, 1))
train.gaussian$quality = ifelse(train.gaussian$votes_useful >= 5 ,3, ifelse(train.gaussian$votes_useful > 0, 2, 1))
cat("The accuracy of the training set is ",
mean(gbm.gaussian.qualityPredict_train == train.gaussian$quality),"\n")
#Prediction, test error
gbm.gaussian.predict_test = predict(gbm.gaussian,test.gaussian, gbm.gaussian.bestIter,type="link",single.tree=FALSE)
#print(max(gbm.gaussian.predict_test))
#print(max(test.gaussian$votes_useful))
gbm.gaussian.qualityPredict_test = ifelse(gbm.gaussian.predict_test >= 5, 3, ifelse(gbm.gaussian.predict_test > 0, 2, 1))
test.gaussian$quality = ifelse(test.gaussian$votes_useful >= 5 ,3, ifelse(test.gaussian$votes_useful > 0, 2, 1))
cat("The accuracy of the testing set is ",
mean(gbm.gaussian.qualityPredict_test == test.gaussian$quality), "\n")
The accuracy for doing regression directly is very low. This might because the number of high, medium and low quality reviews are very imbalanced. Thus doing regression directly using useful votes leads to large error and low accuracy.
%%R
#trainShopping, testShopping data 10:1
trainShopping = read.csv(file = "/Users/shali/Documents/CS249/Midterm/train.csv")
testShopping = read.csv(file = "/Users/shali/Documents/CS249/Midterm/test.csv")
cols = c(5,8:10,12:24,47:93)
trainShopping = na.roughfix(trainShopping[,cols])
testShopping = na.roughfix(testShopping[,cols])
cols = c(1,4:dim(trainShopping)[2])
train = trainShopping[,cols]
train = data.frame(train)
cat("The training dataset dimension is: \n")
print(dim(train))
cat("The training dataset colnames: \n")
print(colnames(train))
test = testShopping[,cols]
test = data.frame(test)
cat("The testing dataset dimension is: \n")
print(dim(test))
cat("The testing dataset colnames: \n")
print(colnames(test))
The text features are appended at the last two columns.
%%R
#Use two layer classification, Bernoulli regression is used
#First layer classification
#First label 1 or 0 by 'zero' votes
rm = ""
train.bernoulli = train
train.bernoulli$quality1 = ifelse(train.bernoulli$votes_useful > 0, 1, 0)
train.bernoulli = train.bernoulli[,c(dim(train.bernoulli)[2],1,3:(dim(train.bernoulli)[2]-1))]
train.bernoulli = apply(train.bernoulli,2,function(x) x/max(x))
train.bernoulli = data.frame(train.bernoulli)
train.bernoulli = train.bernoulli[,!colnames(train.bernoulli)%in% rm]
test.bernoulli = test
test.bernoulli$quality1 = ifelse(test.bernoulli$votes_useful > 0, 1, 0)
test.bernoulli = test.bernoulli[,c(dim(test.bernoulli)[2],1,3:(dim(test.bernoulli)[2]-1))]
test.bernoulli = apply(test.bernoulli,2,function(x) x/max(x))
test.bernoulli = data.frame(test.bernoulli)
test.bernoulli = test.bernoulli[,!colnames(test.bernoulli)%in% rm]
#First layer training
gbm2 = gbm(quality1~.,
distribution = "bernoulli", data = train.bernoulli, var.monotone = NULL,
n.trees = 2500,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
n.cores = NULL)
gbm2.bestIter = gbm.perf(gbm2, method="cv")
#Prediction, train error
gbm2.predict_train = predict(gbm2, train.bernoulli, gbm2.bestIter,type="response",single.tree=FALSE)
gbm2.qualityPredict_train = ifelse(gbm2.predict_train > 0.5, 1, 0)
cat("The accuracy for training set on first layer is ",
mean(gbm2.qualityPredict_train == train.bernoulli$quality1),"\n")
#Prediction, test error
gbm2.predict_test = predict(gbm2, test.bernoulli, gbm2.bestIter,type="response",single.tree=FALSE)
gbm2.qualityPredict_test = ifelse(gbm2.predict_test > 0.5, 1, 0)
cat("The accuracy for testing set on first layer is ",
mean(gbm2.qualityPredict_test == test.bernoulli$quality1),"\n")
#Second Classification
rm = ""
train.bernoulli$votes_useful = train$votes_useful # Include the votes_useful column
#train set contain only high medium review
train.bernoulli.hm = train.bernoulli[train.bernoulli$quality1 == 1,]
#Relabel 1 or 0 for reviews of high or medium quality
train.bernoulli.hm$quality2 = ifelse(train.bernoulli.hm$votes_useful >=5, 1,0)
m = dim(train.bernoulli.hm)[2]
train.bernoulli.hm = train.bernoulli.hm[,c(m,2:(m-2))] #Reorder the data frame
train.bernoulli.hm = train.bernoulli.hm[,!colnames(train.bernoulli.hm) %in% rm]
#Do the same for the test set
test.bernoulli$votes_useful = test$votes_useful
#test set contain only high medium review from the prediction
test.bernoulli.hm = test.bernoulli[gbm2.qualityPredict_test == 1,]
test.bernoulli.hm$quality2 = ifelse(test.bernoulli.hm$votes_useful >=5, 1,0)
m = dim(test.bernoulli.hm)[2]
test.bernoulli.hm = test.bernoulli.hm[,c(m,2:(m-2))]
test.bernoulli.hm = test.bernoulli.hm[,!colnames(test.bernoulli.hm) %in% rm]
#Second Train
gbm3 = gbm(quality2 ~.,
distribution = "bernoulli", data = train.bernoulli.hm, var.monotone = NULL,
n.trees = 2000,interaction.depth = 3,n.minobsinnode = 10,shrinkage = 0.01,bag.fraction = 0.5,
train.fraction = 1.0,cv.folds=5,keep.data = TRUE,verbose = "CV",class.stratify.cv=NULL,
n.cores = NULL)
gbm3.bestIter = gbm.perf(gbm3, method="cv")
#print(gbm3.bestIter)
#Prediction, train error
gbm3.predict_train = predict(gbm3, train.bernoulli.hm, gbm3.bestIter,type="response",single.tree=FALSE)
gbm3.qualityPredict_train = ifelse(gbm3.predict_train > 0.5, 1, 0)
cat("The accuracy for training set on second layer is ",
mean(gbm3.qualityPredict_train == train.bernoulli.hm$quality2),"\n")
#print(dim(train.bernoulli.hm)[1])
#print(length(train.bernoulli.hm$quality2[train.bernoulli.hm$quality2 == 1]))
#Combine the high, medium, low and calculate the train accuracy
index0 = which(gbm2.qualityPredict_train == 0)
#gbm3.qualityPredictRelabel_train = ifelse(gbm3.qualityPredict_train == 1, 3, 2)
qualityPredict.train.bernoulli = c(gbm3.qualityPredict_train,
gbm2.qualityPredict_train[index0])
#train.bernoulli.hm.qualityRelabel = ifelse(train.bernoulli.hm$quality2 == 1, 3, 2)
trainQuality.bernoulli = c(train.bernoulli.hm$quality2,
train.bernoulli$quality1[index0] )
cat("The overall accuracy for training set is ",
mean(qualityPredict.train.bernoulli == trainQuality.bernoulli),"\n")
# Prediction, test error
gbm3.predict_test = predict(gbm3, test.bernoulli.hm, gbm3.bestIter,type="response",single.tree=FALSE)
gbm3.qualityPredict_test = ifelse(gbm3.predict_test >0.5, 1, 0)
cat("The accuracy for testing set on second layer is ",
mean(gbm3.qualityPredict_test == test.bernoulli.hm$quality2),"\n")
#Combine the high, medium, low and calculate the test accuracy
index0 = which(gbm2.qualityPredict_test == 0)
qualityPredict.test.bernoulli = c(gbm3.qualityPredict_test,
gbm2.qualityPredict_test[index0])
#test.bernoulli.hm.qualityRelabel = ifelse(test.bernoulli.hm$quality2 == 1, 3, 2)
testQuality.bernoulli = c(test.bernoulli.hm$quality2, test.bernoulli$quality1[index0])
#Print the accuracy
cat("The overall accuracy on testing data is ",
mean(qualityPredict.test.bernoulli == testQuality.bernoulli), "\n")
After including the text features(number of adjectives and cluster label), the accuray is not improved. The reason might be that regression treats the cluster label as numeric values instead of factors. We could further turn the cluster label into ten extra feature columns, with entries 0 and 1 indicating whether it belongs to a certain cluster.
Below is the code of multiclass random forest and two layers classification. The accuracy are 80.07% and 82.81% respectively. Here we only demonstrate random forest for shopping reviews because of hardware computational limitation.
%%R
### multiclass classification using RF
library(randomForest)
data = read.csv('YelpCsv/yelp_review_train.csv', head=TRUE)
data$quality = ifelse(data$votes_useful > 0 , ifelse(data$votes_useful >= 5 , 2, 1), 0)
data <- data[ which(data$Shopping_index=='TRUE'), ]
cols = names(data)[10:24]
cols = c(cols, names(data)[47:91])
remove <- c ("year","num_smiley", "num_url", "b.min_funny", "num_nums", "num_ellipsis", "num_bracketsPair", "num_asterisk", "num_colon", "num_dollarSign", "yesOrNo")
cols=cols[!cols %in% remove]
train = na.roughfix(data[,cols])
p <- randomForest(votes_useful ~ ., data = train, nodesize = 5, importance = T, proximity = TRUE, ntree=500, do.trace = T)
test = read.csv('YelpCsv/yelp_review_test.csv', head=TRUE)
test$quality = ifelse(test$votes_useful > 0 , ifelse(test$votes_useful >= 5 , 2, 1), 0)
test <- test[ which(test$Shopping_index=='TRUE'), ]
tp = predict(p, test1[,cols])
accuracy = mean(test$quality == tp)
### two layers classification using RF
data = read.csv('YelpCsv/yelp_review_train.csv', head=TRUE)
data$yesOrNo = ifelse(data$votes_useful > 0 , 1, 0)
data$quality = ifelse(data$votes_useful > 0 , ifelse(data$votes_useful >= 5 , 2, 1), 0)
data <- data[ which(data$Shopping_index=='TRUE'), ]
cols = names(data)[12:24]
cols = c(cols, names(data)[47:92])
remove <- c ("num_smiley", "num_url", "b.min_funny", "num_nums", "num_ellipsis", "num_bracketsPair", "num_asterisk", "num_colon", "num_dollarSign")
cols=cols[!cols %in% remove]
data[,cols] = na.roughfix(data[,cols])
p <- randomForest(factor(yesOrNo) ~ ., data = data[,cols], nodesize = 5, importance = T, proximity = TRUE, ntree=500, do.trace = T)
test = read.csv('YelpCsv/yelp_review_test.csv', head=TRUE)
test[is.na(test)] = 0
test$yesOrNo = ifelse(test$votes_useful > 0 , 1, 0)
test$quality = ifelse(test$votes_useful > 0 , ifelse(test$votes_useful >= 5 , 2, 1), 0)
test <- test[ which(test$Shopping_index=='TRUE'), ]
tp = predict(p, test[,cols])
test1 = test[tp == '1',]
data1 = data[which(data$quality > 0),]
cols = cols[!cols=="yesOrNo"]
cols = c(cols, remove, "quality")
remove <- c ("num_smiley", "num_url", "u.max_stars", "u.min_stars" ,"b.max_stars", "b.min_cool", "b.min_useful", "b.sd_stars", "b.median_cool", "b.avg_funny", "b.avg_stars", "num_checkin", "num_asterisk")
cols = cols[!cols %in% remove]
p1 <- randomForest(factor(quality) ~ ., data = data1[,cols], nodesize = 5, importance = T, proximity = TRUE, ntree=500, do.trace = T)
tp1 = predict(p1, test1[,cols])
index0 =which(tp == 0)
trest = test$quality[-index0]
accuracy = (sum(test$quality[index0]) + sum(trest!=tp1))/ nrow(test)
The result is shown below.
%%R
load("/Users/shali/Documents/CS249/Midterm/CS249_Project/RFresult.RData")
plot(p)
print(p$importance)
The figure shows that it converge after 400 trees.
We plot the importance of features of multiclass classification as below. The most important feature is user's median value of number of useful votes. The number of smiley and number of url have negative effect on classification.
from IPython.display import Image
Image(filename='/Users/shali/Documents/CS249/Midterm/CS249_Project/feature.png',
embed=True)
We tried to use top two most important features in different categories (aka useful votes and cool votes) to classify reviews and below is the result. The color indicates different classes. The result shows these two features domninate the classification.
from IPython.display import Image
Image(filename='/Users/shali/Documents/CS249/Midterm/CS249_Project/classificationBy2feature.jpeg',
embed=True)
Next we tried to directly do regression on the number of useful votes so that we could compare RMSLE result with the results of a kaggle competition. Our result is 0.37843!
%%R
### Regression using RF
data = read.csv('YelpCsv/yelp_review_train.csv', head=TRUE)
data1 <- data[ which(data$Shopping_index=='TRUE'), ]
cols = names(data)[10:24]
cols = c(cols, names(data)[47:91])
remove <- c ("year","num_smiley", "num_url", "b.min_funny", "num_nums", "num_ellipsis", "num_bracketsPair", "num_asterisk", "num_colon", "num_dollarSign", "yesOrNo")
cols=cols[!cols %in% remove]
train = na.roughfix(data1[,cols])
p <- randomForest(votes_useful ~ ., data = train, nodesize = 5, importance = T, proximity = TRUE, ntree=500, do.trace = T)
tp = predict(p, test[,cols])
rmsle(tp, p$useful_votes)
%%R
load("/Users/shali/Documents/CS249/Midterm/newRF.RData")
%%R
print(p$importance)
from IPython.display import Image
Image(filename='/Users/shali/Documents/CS249/Midterm/result2.png',
embed=True)
from IPython.display import Image
Image(filename='/Users/shali/Documents/CS249/Midterm/result1.png',
embed=True)
By using two-layer binary classification, we get 84.25% accuracy, which is fairly good. To compare our results with kaggle Yelp competition results, we did regression using random forest. And our RMSLE score for the shopping category is 0.37843, which is better than the best result in kaggle. Investigating the reason why our result is much better, we found that in kaggle contest, the user data is not complete. There are some missing users tuples due to the privacy policy. And in the test data, some user information is removed. However, in our case, we used the academic dataset, and our most important feature is user-wise features. Moreover, we first divided the data into different categories, and then did regression. Because data of the same category tends to be consistent, the regression will lead to lower RMSLE score. Since we have not looked into other categories, we do not know if the RMSLE for other categories would be as good as shopping category.
In this project, we successfully accomplished our goal of predicting the qualities of reviews, by extracting features from the Yelp Academic Dataset, text analysis, and multi-class classification. In result, we achieved good prediction accuracy.
Yelp Academic Dataset from the greater Phoenix, AZ metropolitan area is used as the raw data. Firstly, the data contains all aspects of reviews including business, check-in, users, tips, reviews, which provides us with complete review information on both user-level and business-level. Secondly, the data covers reviews in a long period and also fresh reviews in 2014. Based on the review ages, we could easily divide the dataset into train and test set for prediction purpose. The raw data itself is complete and comprehensive which is perfect for the later feature extraction.
With only a few attributes in the data we can use directly, we delved into the dataset and acquired some features. Since the raw data is large, the process of extracting features was very time-consuming. To extract features, we employed map-reduce method and other statistical methods to get numerical features. And we utilized regular expression to find characteristics of the text. Moreover, we used nltk module to analysis the review text, and we analyzed the word classes used in the text and we clustered the text according to the text similarity. Dealing with large amount of text was tedious and we finally got 61 useful features.
After acquiring the features, we spent much time testing the feature correctness and merging all features into one data frame for further investigation. We adopted parallel coordinate plots to visualize the data according to the classes. We observed that our features contribute to the classification problem. In the feature extraction process, we analyzed the similarity of review text, and drew a network to show their connections and relationships. In classification process, after investigating various classification and regression methods, we chose gradient boost and random forest as two main classification approaches because of their high reliability in terms of accuracy. In each of these two methods, we tried several different classification methods, e.g. Muticlasses vs. Two-layer. Moreover, we made great efforts in studying the relative importance of features to help improve the test accuracy. Other statistical methods such as LDA and clustering are also utilized in our projects.
The ipython notebook is presented in an organized and coherent way. We first introduced the idea and then showed the R script with comments. At each step, we tried our best to print useful results together with meaningful plots. Mutilple plotting tools was used to give good demonstration, such as ggplot.
Our project is meaningful in providing Yelp users and business a guide to write and obtain useful reviews. Based on our project, useful reviews could be recommended and put on top on Yelp to help users make wise decisions quicker and easier. Various data mining tools are applied to solve real problems in this project and we achieved good accuracy in prediction. We challenge ourselves in dealing with large dataset and also figuring out diverse methods to improve the results and demonstrate good analyzing and problem-solving skills.
Through this project, we gained a deeper understanding on how to use data mining tools to solve practical problems, for example, data pre-processing, feature extraction, visualization and analysis. Besides the ability to use data mining tools, we got strong intuition about data analysis, which is of great significance before we applied strategies to analyze data. That is, given a ‘good’ classification method, how we could investigate the data in details (e.g., feature relative importance, class balance, loss function choice) and make the most of the model itself. We also got good experience in documentation with ipython notebook and enjoyed teamwork.
https://www.kaggle.com/c/yelp-recruiting
http://en.wikipedia.org/wiki/Gradient_boosting
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
\(\textit{Thank you and have a great SUMMER!:)}\)